Skip to content

Conversation

@siramok
Copy link
Collaborator

@siramok siramok commented Sep 19, 2025

Resolves bullets in #1513.

  • Implements support for computing multiple energy group outputs within one rover execute.
  • Converts ROVER_ERROR calls to ASCENT_LOG_ERROR calls, which correctly throw exceptions.
  • Several tests were created, including one to validate MPI support.
  • Improved/added some parallel error checking.

Lots still TODO:

  • Verify that this approach is what we want and mean by multiple energy group support. Seems to be.
  • Add more tests, particularly MPI tests (I haven't tried it with MPI yet). Seems to work fine with MPI.
  • Add validation somewhere to ensure that if we have more than one energy group, that the number of absorption and emission fields are the same. We now check for this in at least 2 places.
  • Each domain is only traced once, but there seems to be some modest overhead compared to doing multiple rover execute calls with one energy group at a time. My guess is that this is due to deinterleaving optical depth and intensity fields at the end (the access pattern is probably not cache friendly). There's probably a smarter way to deal with it. The performance improvement is significant.
  • Some of the baseline images using the image_topo topology seem to have abnormally large extents, currently not sure why. This was apparently due to slicing on a boundary edge with z = 0. Bumping this up to z = 0.001 produces a render with the expected extents.

Copy link
Member

@JustinPrivitera JustinPrivitera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really fantastic work. Thanks so much for this. I just had a few clarifying comments and questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this zoomed in like this?

Copy link
Collaborator Author

@siramok siramok Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As best I can tell, it seems to be due to using z = 0 in the slice. Using z values close to 0 (0.1, 0.01, or 0.001) have the same legend values as z = 0 but are not zoomed out, so that's my current "fix" for this. Perhaps this is an ascent bug when slicing precisely on a boundary edge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if we should open a ticket, but at the same time, it seems questionable in general to slice on a boundary edge. For the VisIt tests, I calculated the midpoints of the energy groups to take the slices. Glad you got this figured out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can defer to @cyrush on whether or not this counts as a bug (context: getting no output when slicing precisely on a boundary). If so, I'd be happy to open an issue.

To your second point, I may as well change the current implementation to use the midpoints also. I don't think it will make a difference in the baselines, but consistency with VisIt is good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to follow up, but I did switch to slicing at the midpoints instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think slicing on a boundary is an important case to worry about. I like using midpoints.

@siramok
Copy link
Collaborator Author

siramok commented Sep 24, 2025

Failing Windows CI seems to be unrelated to my latest commit.

Next thing I need to do is implement those additional tests, particularly to make sure that this works with MPI. As far as my comment about performance, I realized I was going by the total test time which includes time spent creating the additional fields. I need to measure just the ascent execute calls for a proper comparison.

@siramok
Copy link
Collaborator Author

siramok commented Sep 25, 2025

Seems to work out of the box with MPI. Separately, I realized that ROVER_ERROR wasn't actually throwing an exception, so I've converted those into ASCENT_LOG_ERROR calls which does throw an exception. The benefit there is that we can actually test for errors now, like making sure that we error if absorption and emission don't have the same number of components. The other benefit is that rover won't just keep running until it or viskores segfaults.

@siramok
Copy link
Collaborator Author

siramok commented Sep 25, 2025

And finally, after looking at just the ascent.execute time to run rover, the performance uplift with multi-group support seems to be great. I measured an average of 0.3302 seconds when tracing curv3d with one energy group. Extrapolating, that would be about 0.9906 seconds to trace 3 separate curv3d energy groups one at a time (the only way that was possible before this PR). I then measured an average of 0.5617 seconds to trace 3 energy groups at once with this PR, roughly a 56% improvement. Using the same methodology but with 1024x1024 traces instead of 200x200, it's closer to a 66% improvement.

@siramok siramok changed the title WIP: Add support for tracing multiple energy groups Add support for tracing multiple energy groups in Rover Sep 25, 2025
@siramok siramok marked this pull request as ready for review September 25, 2025 05:01

if (num_absorption_bins != num_emission_bins)
{
ASCENT_LOG_ERROR("Error - Engine::get_num_energy_groups: number of energy groups in absorption field ("
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need parallel error checking for a case like this. If this gets called by multiple ranks and one fails then we are in trouble. Although its very unlikely we would have a mismatch here.

Copy link
Collaborator Author

@siramok siramok Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I solved this, but it was a slightly more invasive change than expected, so let me know what you think. Adding parallel error checking in-place (inside of engine.cpp) didn't work because that was causing the MPI test which puts all of the data on rank 0 to fail. It turns out that the ranks that didn't have any data would never reach the MPI_Allreduce, causing rank 0 to hang. I was admittedly pleased to see that ranks with nothing to do weren't bothering to do the setup work required before tracing rays.

My workaround was to move the parallel error check to TypedScheduler in a place that we know all of the ranks will reach and execute. We still check for a mismatched number of fields inside of engine.cpp, but instead of throwing an error right away, detecting a mismatch sets a member variable. Then inside of TypedScheduler, each rank loops over each of its domains and checks to see if any of them had a mismatch. Ranks that had no domains won't have anything to iterate over, and will therefore not have any mismatches.

I don't think we currently have a test that exercises this, but I could come up with one if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always best to make tests to capture all the cases we can, but we aren't always good about it. If you want to add a test you can but I'm not even sure how you would cause this. Num absorption bins would need to be different than num emission bins on only some ranks. I imagine we would hit a Conduit Relay reading error before we ever made it here.

Copy link
Member

@JustinPrivitera JustinPrivitera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things about parallel error checking and MPI tests.

Copy link
Member

@JustinPrivitera JustinPrivitera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for getting this in!

@siramok siramok merged commit d731dc5 into develop Oct 20, 2025
19 checks passed
@siramok siramok deleted the task/siramok/09_18_25/multiple_groups branch October 20, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants